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ABSTRACT 

This paper presents a novel mechanism to adapt surrogate- 
assisted population-based algorithms. This mechanism is 
applied to ACM-ES, a recently proposed surrogate-assisted 
variant of CMA-ES. The resulting algorithm, S *ACM-ES, 
adjusts online the lifelength of the current surrogate model 
(the number of CMA-ES generations before learning a new 
surrogate) and the surrogate hyper-parameters. 

Both heuristics significantly improve the quality of the 
surrogate model, yielding a significant speed-up of "ACM- 
ES compared to the ACM-ES and CMA-ES baselines. The 
empirical validation of "ACM-ES on the BBOB-2012 noise- 
less testbed demonstrates the efficiency and the scalability 
w.r.t the problem dimension and the population size of the 
proposed approach, that reaches new best results on some 
of the benchmark problems. 

Categories and Subject Descriptors 

1.2.8 [Computing Methodologies]: Artificial Intelligence 
Problem Solving, Control Methods, and Search 

General Terms 

Algorithms 

Keywords 

Evolution Strategies, CMA-ES, self-adaptation, surrogate- 
assisted black-box optimization, surrogate models, ranking 
support vector machine 

1. INTRODUCTION 

Evolutionary Algorithms (EAs) have become popular tools 
for optimization mostly thanks to their population-based 
properties and the ability to progress towards an optimum 
using problem-specific variation operators. A search directed 
by a population of candidate solutions is quite robust with 
respect to a moderate noise and multi-modality of the op- 
timized function, in contrast to some classical optimization 
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methods such as quasi-Newton methods (e.g. BFGS method 
[25]). Furthermore, many bio-inspired algorithms such as 
EAs, Differential Evolution (DE) and Particle Swarm Opti- 
mization (PSO) with rank-based selection are comparison- 
based algorithms, which makes their behavior invariant and 
robust under any monotonous transformation of the objec- 
tive function. Another source of robustness is the invariance 
under orthogonal transformations of the search space, first 
introduced into the realm of continuous evolutionary opti- 
mization by Covariance Matrix Adaptation Evolution Strat- 
egy (CMA-ES) [9]. CMA-ES, winner of the Congress on 
Evolutionary Computation (CEC) 2005 [5] and the Black- 
Box Optimization Benchmarking (BBOB) 2009 8: compe- 
titions of continuous optimizers, has also demonstrated its 
robustness on real-world problems through more than one 
hundred applications [Jj. 

When dealing with expensive optimization objectives, the 
well-known surrogate-assisted approaches proceed by learn- 
ing a surrogate model of the objective, and using this sur- 
rogate to reduce the number of computations of the objec- 
tive function in various ways. The best studied approach 
relies on the use of computationally cheap polynomial re- 
gression for the line search in gradient-based search meth- 
ods, such as in the BFGS method [25]. More recent ap- 
proaches rely on Machine Learning algorithms, modelling 
the objective function through e.g. Radial Basis Functions 
(RBF), Polynomial Regression, Support Vector Regression 
(SVR), Artificial Neural Network (ANN) and Gaussian Pro- 
cess (GP) a.k.a. Kriging. As could have been expected, 
there is no such thing as a best surrogate learning approach 
[141 120| . Experimental comparisons also suffer from the fact 
that the results depend on the tuning of the surrogate hyper- 
parameters. Several approaches aimed at the adaptive se- 
lection of surrogate models during the search have been pro- 
posed 27, 28, [T] ; these approaches focus on measuring the 
quality of the surrogate models and using the best one for 
the next evolutionary generation. 

This paper, aimed at robust surrogate-assisted optimiza- 
tion, presents a surrogate-adaptation mechanism ( s *) which 
can be used on top of any surrogate optimization approach. 
s * adapts online the number of generations after which the 
surrogate is re-trained, referred to as the surrogate lifelength; 
further, it adaptively optimizes the surrogate hyper-parame- 
ters using an embedded CMA-ES module. A proof of prin- 
ciple of the approach is given by implementing s * on top of 
ACM-ES, a surrogate-assisted variant of CMA-ES, yielding 
the S AGM-ES algorithm. To our best knowledge, the self- 
adaptation of the surrogate model within CMA-ES and by 



CMA-ES is a new contribution. The merits of the approach 
are shown as "ACM-ES show significant improvements com- 
pared to CMA-ES and ACM-ES on the BBOB-2012 noiseless 
testbed. 

The paper is organized as follows. Section [5] reviews some 
surrogate-assisted variants of Evolution Strategies (ESs) and 
CMA-ES. For the sake of self-containedness, the ACM-ES 
combining CMA-ES with the use of a Rank-based Support 
Vector Machine is briefly described. Section [3] discusses 
the merits and weaknesses of ACM-ES and suggests that 
the online adjustment of the surrogate hyper-parameters is 
required to reach some robustness with respect to the op- 
timization objective. The S *ACM-ES algorithm, including 
the online adjustment of the surrogate lifelength and hyper- 
parameters on top of ACM-ES, is described in section [4] 
The experimental validation of S *ACM-ES is reported and 
discussed in section [5] Section [6] concludes the paper. 

2. SURROGATE MODELS 

This section discusses the various techniques used to learn 
surrogate models, their use within EAs and specifically CMA- 
ES, and the properties of surrogate models in terms of in- 
variance w.r.t. monotonous transformations of the objective 
function [24], and orthogonal transformations of the instance 
space [21] , 

2.1 Surrogate-assisted Evolution Strategies 

As already mentioned, many surrogate modelling approa- 
ches have been used within ESs and CMA-ES: RBF network 
P3], CP [21 g], ANN [6], SVR [301 EES], Local- Weighted 
Regression (LWR) [H [3], Ranking SVM [24] [21] [12]. In 
most cases, the surrogate model is used as a filter (to select 
Apre promising pre-children) and/or to estimate the objec- 
tive function of some individuals in the current population. 
The impact of the surrogate, controlled by Ap re , should 
clearly depend on the surrogate accuracy; how to measure it 
? As shown by [15] . the standard Mean Square Error (MSE) 
used to measure a model accuracy in a regression context is 
ill-suited to surrogate-assisted optimization, as it is poorly 
correlated with the ability to select correct individuals. An- 
other accuracy indicator, based on the (weighted) sum of 
ranks of the selected individuals, was proposed by [15] , and 
used by [30] QT]. 



2.2 Comparison-based Surrogate Models 

Taking advantage of the fact that some EAs, and particu- 
larly CMA-ES, are comparison-based algorithms, which only 
require the offspring to be correctly ranked, it thus comes 
naturally to learn a comparison-based surrogate. Compari- 
son-based surrogate models, first introduced by Runarsson 
|24| . rely on learning-to-rank algorithms, such as Ranking 
SVM [17| . Let us recall Ranking SVM, assuming the reader's 
familiarity with Support Vector Machines [26| . 

Let (xi, . . . ,xn) denote an iV-sample in instance space X, 
assuming with no loss of generality that point Xi has rank i. 
Rank-based SVM learning [T7] aims at a real- valued function 
f onX such that f(xi) < f(xj) iff i < j. In the SVM frame- 
work, this goal is formalized through minimizing the norm 
of / (regularization term) subject to the N(N — l)/2 or- 
dering constraints. A more tractable formulation, also used 
in [241 121] . only involves the N — 1 constraints related to 
consecutive points, f(xi) < f(x i+ i) for i = 1 . . . N — 1. 



Using the kernel triclfl ranking function / is defined as 
a linear function w w.r.t. some feature space &(X), i.e. 
f(x) — { w,&(x) }. With same notations as in [26], the 
primal minimization problem is defined as follows: 



Minimize^ e} j ||w|| +X) i=1 Cj& 

w,$(xi) - $(x i+ i) ) > 1 



subj. to 



f< > (i = l...N-l) 



(i = l, ..JV-1) 
(1) 

where slack variable £j (respectively constant d) accounts 
for the violation of the i-th constraint (resp. the violation 
cost). The corresponding dual problem, quadratic in the La- 
grangian multipliers a, can be solved easily by any quadratic 
programming solver. The rank-based surrogate / is given as 

f( x ) = Efei 1 ai(K(xi,x) - K{x i+ i,x)) 

By construction, f(x) is invariant to monotonous transfor- 
mations of the objective function, which preserve the rank- 
ing of the training points. 

2.3 Invariance w.r.t. Orthogonal Transforma- 
tions 

As already mentioned, CMA-ES is invariant w.r.t. orthog- 
onal transformations of the search space, through adapting a 
covariance matrix during the search. This invariance prop- 
erty was borrowed by ACM-ES [21], using the covariance 
matrix C learned by CMA-ES within a Radial Basis Func- 
tion (RBF) kernel Kc, where C _1 is used to compute Ma- 
halanobis distance: 



K c (xi,Xj 



(2) 



For the sake of numerical stability, every training point x is 
mapped onto x' such as 



x' = C~ 1/2 (x-m) 



(3) 



where m is the mean of the current CMA-ES distribution. 
The standard RBF kernel with Euclidean distance is used 
on top of this mapping. By construction, ACM-ES inherits 
from CMA-ES the property of invariance under orthogonal 
transformations of the search space; the use of the covari- 
ance matrix C brings significant improvements compared to 
standard Gaussian kernels after [21] , 

3. DISCUSSION 

This section analyzes the weaknesses of ACM-ES. Fol- 
lowing the characterization proposed in [16| . ACM-ES is a 
surrogate-assisted optimizer with an individual-based evolu- 
tion control. As in other pre-selection methods, at each gen- 
eration ACM-ES generates Ap re individuals, where Ap re is 
much larger than population size A. Then Ap re pre-children 
are evaluated and ranked using surrogate model /. The most 
promising A' pre-children are selected and evaluated using 
the true expensive function, yielding new points (x,f(x)). 
When the objective function of A' individuals is known, the 
ranking of other A — A' points can be approximated. 

The so-called kernel trick supports the extension of the 
SVM approach from linear to non-linear model spaces, by 
mapping instance space X onto some feature space <&{X). 
The actual mapping cost is avoided as the scalar product 
in feature space <&(V) is computed on instance space X 
through a kernel function K: { <b(x),<&(x') ) —def K(x,x'). 




Number of evaluations Number of generations 

Figure 1: Left: Rank-based surrogate error vs number of evaluations, during a representative run of active 
CMA-ES on 10-D Rotated Ellipsoid. Right: The speedup of IPOP-aACM-ES over IPOP-aCMA-ES, where 
speedup = 2.0 means that IPOP-aACM-ES with a given lifelength n of the surrogate model, requires 2.0 
times less computational effort SP1 (i.e. average number of function evaluations of successful runs divided 
by proportion of successful runs) than IPOP-aCMA-ES to reach the target objective value of ft = / op t + 10 -8 . 



While our experimental results confirm the improvements 
brought by ACM-ES on some functions (about 2-4 times 
faster than CMA-ES on Rosenbrock, Ellipsoid, Schwefel, 
Noisy Sphere and Ackley functions up to dimension 20 [2T] ) , 
they also show a loss of performance on the multi- modal Ras- 
trigin function. Complementary experiments suggest that: 

1. on highly multi- modal functions the surrogate model 
happens to suffer from a loss of accuracy; in such cases 
some control is required to prevent the surrogate model 
from misleading the search; 

2. surrogate-assisted algorithms may require a larger pop- 
ulation size for multi-modal problems. 

The lack of surrogate control appears to be an important 
drawback in ACM-ES. This control should naturally reflect 
the current surrogate accuracy. A standard measure of the 
rank-based surrogate error is given as the fraction of violated 
ranking constraints on the test set [17]. Accuracy (respec- 
tively .5) corresponds to a perfect surrogate (resp. random 
guessing). 

However, before optimization one should be sure that the 
model gives a reasonable prediction of the optimized func- 
tion. Fig. [T] (Left) illustrates the surrogate model error dur- 
ing a representative run of active CMA-ES (with re-training 
at each iteration, but without any exploitation of the model) 
on 10 dimensional Rotated Ellipsoid function. After the first 
generations, the surrogate error decreases to approximately 
10%. This better than random prediction can be viewed as a 
source of information about the function which can be used 
to improve the search. 

Let n denote the number of generations a surrogate model 
is used, referred to as surrogate lifelength. In so-called 
generation-based evolution control methods [16], the sur- 
rogate / is directly optimized for n generations, without 
requiring any expensive objective computations. The fol- 
lowing generation considers the objective function /, and 
yields instances to enrich the training set, relearn or refresh 
the surrogate and adjust some parameters of the algorithm. 
The surrogate lifelength n is fixed or adapted. 

The impact of n is displayed on Fig. [1] (Right), show- 
ing the speedup reached by direct surrogate optimization on 
several 10 dimensional benchmark problems vs the number 



of generations n the surrogate is used. A factor of speedup 
1.7 is obtained for n= 1 on the Rotated Ellipsoid function, 
close to the optimal speedup 2.0. A speedup ranging from 
2 to 4 is obtained for IPOP-aCMA-ES with surrogates for 
n in [5, 15]. As could have been expected again, the optimal 
value of n is problem-dependent and widely varies. In the 
case of the Attractive Sector problem for instance, the sur- 
rogate model is not useful and n — should be used (thus 
falling back to the original aCMA-ES with no surrogate) to 
prevent the surrogate from misleading the search. 

4. SELF-ADAPTIVE SURROGATE-BASED 
CMA-ES 

In this section we propose a novel surrogate adaptation 
mechanism which can be used in principle on top of any 
iterative population-based optimizer without requiring any 
significant modifications thereof. The approach is illustrated 
on top of CMA-ES and ACM-ES. The resulting algorithm, 
S *ACM-ES, maintains a global hyper-parameter vector 6 — 
(6 op t, aur ,n, A, a), where: 

• O pt stands for the optimization parameters of the CMA- 
ES used for expensive function optimization; 

• 8 S ur stands for the optimization parameters of the CMA- 
ES used for surrogate model hyper-parameters optimization; 

• n is the number of optimization generations during which 
the current surrogate model is used; 

• A is the archive of all points (xi, f(xi)) for which the true 
objective function has been computed, exploited to train the 
surrogate function; 

• a stands for the surrogate hyper-parameters. 

All hyper-parameters are indexed by the current generation 
g; by abuse of notations, the subscript g is omitted when 
clear from the context. 

The main two contributions of the paper regard the ad- 
justment of the surrogate hyper-parameters (section 14. 2[) 
and of the surrogate lifelength n (section |4.3[) . 

4.1 Overview of ACM-ES 

Let GenCMA(h,6h,A) denote the elementary optimization 
module (here one generation of CMA-ES) where h is the 
function to be optimized (the true objective / or the sur- 
rogate /), 6 h denotes the current optimization parameters 



Algorithm 1 "ACM-ES 



Build Surrogate Model f(x) of fix) 
using model hyper-parameters a 



Optimize f(x) for n generations 



Optimize f{x) for 1 generation 



Estimate model error Err(a) 
using last A evaluated points 



Adjust number of generations n 



optio nally 



Optimize Err(ci) for 1 generation 
a = new mean of distribution 



Ranking SVM 



CMA-ES #1 
in space x 

CMA-ES #1 
in space x 

fraction of 
misranking 



' Err(a) 



CMA-ES #2 
in space a 



Figure 2: Optimization loop of the "ACM-ES. 
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g <r- 0; Err <- 0.5; A, <- 0; 

op t InitializationCMA(); { to optimize f(x) } 
6sur InitializationCMA(); { to optimize h(a) } 
repeat 

{0o P t,A g+ i} «- GenCMA(/,t? opt , Ag); 

until g = gstart ; 
repeat 

f(x) <s— BuildSurrogateModel(a, A g , 9 op t); 

gprev 4 g; 

for i = 1, . . . , n do 

{0opt,A g+ i = A g } <- GenCMA(f,e opt , Ag); 

9*-g + i; 

{0 O pt,Ag+i} <- GenCMA(/,0 opt , A g ); 

ff-*-5 + i; 

Err(a) <— MeasureSurrogateError(/,f? opt ); 
Err <- (1 - /3 Err )Err + /3 Err Err(a); 
n <_ I W^Err -r I . 

II adjust surrogate hyperparameters 

e sur <- GenCMA(Err,r? sur ); 
a ^— 9 sur .m; 
until stopping criterion is met ; 



(e.g. CMA-ES step-size and covariance matrix) associated 
to h, and A is the archive of /. After each call of GenCMA, 
optimization parameters 6h are updated; and if GenCMA 
was called with the true objective function /, archive A 
is updated and augmented with the new points (x,f(x)). 
Note that GenCMA can be replaced by any black-box op- 
timization procedure, able to update its own optimization 
parameters and the archive. 

"ACM-ES starts by calling GenCMA for g ata rt number of 
generations with the true objective /, where op t and A are 
respectively initialized to the default parameter of CMA- 
ES and the empty set (lines I4I7[1 . In this starting phase, 
optimization parameter 8 op t and archive A are updated in 
each generation. 

Then S *ACM-ES iterates a three-step process (Algorithm 
[U illustrated on Fig. 0: 

1 learning surrogate / (procedure BuildSurrogateModel, 
lineal section H3J; 

2 Optimizing surrogate / during n generations (lines [TT} 
EE3J). This step classically calls GenCMA(f,9 opt ,A) for 
n consecutive generations; 8 op t is updated accordingly 
while A is unchanged since this step does not involve 
any computation of the expensive /. 

3 Adjusting the surrogate lifelength n (section |4.3[) . 

4.2 Learning a Surrogate and Adjusting its Hy- 
per-parameters 

The surrogate model learning phase proceeds as in ACM- 
ES (section 123)) . GenCMA(f ,9 opt ,A) is launched for one 
generation with the true objective /, updating and augment- 
ing archive A with new (x,f(x)) points. 

/ is built using Ranking SVM 17 with archive A as train- 
ing set, where the SVM kernel is tailored using the current 
optimization parameters 9 op t such as covariance matrix C. 



Algorithm 2 Objective function Err(a) of surrogate model 
1: Input: a 

2: f(x) <— BuildSurrogateModel(a, A gprevl #sur, 9prc „ ) ; 
3: Err(a) «— MeasureSurrogateError(/, 9 pt,g pT . cv ); 
4: Output: Err(a); 



The contribution regards the adjustment of the surrogate 
hyper-parameters a (e.g. the number and selection of the 
training points in A; the weights of the constraint violations 
in Ranking SVM, section |2"3|) . which are adjusted to opti- 
mize the quality of the surrogate Err (Eq. |4]). Formally, 
to each surrogate hyper-parameter vector a is associated a 
surrogate error Err(a) defined as follows: hyper-parameter 
a is used to learn surrogate f a using .4 9 -i as training set, 
and Err(a) is set to the ranking error of f a , using the most 
recent points (A g — -4 9 -i) as test set. 

Letting A denote the test set and assuming with no loss of 
generality that the points in A are ordered after /, Err(/ a ) 
is measured as follows (procedure MeasureSurrogateError, 
line [16]): 

„ |A| |A| 

^^SFI)^.^- 1 ^ (4) 

where lj- holds true iff f a violates the ordering con- 

Ja,i,j 

straint on pair (i,j). In all generality, the surrogate error 
can be tuned using weight coefficients Wij to reflect the rel- 
ative importance of ordering constraints. Only Wij = 1 will 
be used in the remainder of the paper. For a better numer- 
ical stability, the surrogate error is updated using additive 
relaxation, with relaxation constant /3g rr (lines [TBI 17[) . 

Finally, the elementary optimization module 
GenCMA(Err,9 aur ) (in this study we do not use archive pa- 
rameter here) is launched for one generation (line [20}, and 
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Figure 3: Number of generations n versus surro- 
gate error Err. Linear interpolation (bold curve) 
has been used in the experimental validation. 

the mean of the CMA-ES mutation distribution is used (line 
I21|) as surrogate hyper-parameter vector in the next surro- 
gate building phase (lineal. 

4.3 Adjusting Surrogate Lifelength 

Lifelength n g is likewise adjusted depending on the error 
made by the previous surrogate / s _i on the new archive 
points (A g — Ag—i). If this error is 0, then f g -i is perfectly 
accurate and could have been used for some more genera- 
tions before learning f g . In this case lifelength n g is set to 
the maximum value n max , which corresponds to the maxi- 
mum theoretical speedup of the S *ACM-ES. 
If the error is circa .5, surrogate f g -i provides no better 
indications than random guessing and thus misleads the op- 
timization; rig is set to 0. More generally, considering an 
error threshold r err , n is adjusted between and 0, pro- 

portionally to the ratio between the actual error and the 
error threshold r err (line 1181 bold curve on Fig. |3}- 

5. EXPERIMENTAL VALIDATION 

The experimental validation of the approach proceeds by 
comparing the performance of S *ACM-ES to the original [9] 
and active [TO] CMA-ES versions, considering the restart 
scenario with increasing population size (IPOP [2l HOj). 

The active IPOP-aCMA-ES [TO] with weighted negative 
covariance matrix update is found to perform equally well or 
better than IPOP-CMA-ES, which is explained as it more 
efficiently exploits the information of the worst A/2 points. 
We use IPOP-aCMA-ES as challenging baseline, more dim- 
cult to speed up than the original IPOP-CMA-ES. 

Specifically, S *ACM-ES is validated on the noiseless BBOB 
testbed by comparing IPOP-aACM-ES with fixed hyper- 
parameters, and IPOP-"*aACM-ES with online adaptation 
of hyperparameters of the surrogate mode0- 

After detailing the experimental setting, this section re- 
ports on the offline tuning of the number N tr ainin g of points 

2 Complementary experiments omitted for brevity, show that 
the best adjustment of n depending on the surrogate error 
is again problem-dependent. 

3 For the sake of reproducibility we used the Octave/MatLab 
source code of IPOP-CMA-ES with default parameters, 
available from its author's page, with the active flag set to 
1. The S *AC M-ES source code is availa ble at 

https : //sites . google . com/ site/ acmesgecco/ 



Parameter Range for online tuning Offline tuned value 

Ntraining [4d, 2(40 + f^F 7 ])] 40 + [^'J 

C baS e [0, 10] 6 

C pow [0, 6] 3 

Csi g ma [0.5, 2] 1 

Table 1: Surrogate hyper-parameters, default value 
and range of variation 



used to learn the surrogate model, and the online tuning of 
the surrogate hyper-parameters. 

5.1 Experimental Setting 

The default BBOB stopping criterion is reaching target 
function value ft = f op t + 10~ 8 . Ranking SVM is trained 
using the most recent Ntrainin g points (subsection 15. 2[) ; its 
stopping criterion is arbitrarily set to a maximum number 
of 1000Ntrainin g iterations of the quadratic programming 
solver. 

After a few preliminary experiments, the Ranking SVM 
constraint violation weights (Eq. [TJ are set to 

C — 1 Cpbase ( V . . _ :\Cpow 
— J-U \ 1 'traimn g L ) 

with Chase = 6 and C pow = 3 by default; the cost of con- 
straint violation is thus cubically higher for top-ranked sam- 
ples. The a parameter of the RBF kernel is set to a — 
Csi gm aO'x, where a x is the dispersion of the training points 
(their average distance after translation, Eq. [3]) and c S i gma 
is set to 1 by default. The number g s tart of CMA-ES calls in 
the initial phase is set to 10, the maximum lifelength n^n~ax 
of a surrogate model is set to 20. The error threshold r err 
is set to .45 and the error relaxation factor is set to .2. 

The surrogate hyper-parameters 6 sur are summarized in 
Table[T] with offline tuned value (default for IPOP-aACM- 
ES) and their range of variation for online tuning, (where d 
stands for the problem dimension). Surrogate hyper-para- 
meters are optimized with a population size 20 (20 surro- 
gate models), where the Err function associated to a hyper- 
parameter vector is measured on the most recent A points in 
archive A, with A the current optimization population size. 

5.2 Offline Tuning: Number of Training Points 

It is widely acknowledged that the selection of the train- 
ing set is an essential ingredient of surrogate learning [16] - 
After some alternative experiments omitted for brevity, the 
training set includes simply the most recent Ntraining points 
in the archive. The study thus focuses on the tuning of 
Ntraining ■ Its optimal tuning is of course problem- and sur- 
rogate learning algorithm-dependent. Several tunings have 
been considered in the literature, for instance for 10-dimen- 
sional problems: 3 A for SVR [3D]; 30 for RBF [29]; 50 for 
ANN 0; A,2A for Ranking SVM[2J [TOJ; ^4±li + 1 = 66 
for LWR in the lmm-CMA-ES [3]; 70Vd = 221 for Ranking 
SVM in the ACM-ES [21]. 

In all above cases but ACM-ES, the surrogate models aim 
at local approximation. These approaches might thus be 
biased toward small Ntraining values, as a small number of 
training points are required to yield good local models (e.g. 
in the case of the Sphere function), and small Ntrainin g val- 
ues positively contribute to the speed-up. It is suggested 




Figure 4: The speedup of IPOP-aACM-ES over 
IPOP-aCMA-ES w.r.t. (fixed) number of training 
points. 



Figure 5: The median trajectories of normalized sur- 
rogate hyper-parameters estimated on 15 runs of the 
IPOP- s *aACM-ES on Rotated Ellipsoid 20-D. 



however that the Sphere function might be misleading, re- 
garding the optimal adjustment of Ntraining- 

Let us consider the surrogate speed-up of IPOP-aACM- 
ES w.r.t. IPOP-CMA-ES depending on (fixed) Ntraining, on 
uni-modal benchmark problems from the BBOB noiseless 
testbed (Fig. [4] for d — 10) . While the optimal speed-up 
varies from 2 to 4, the actual speed-up strongly depends on 
the number Ntraining of training points. 

Complementary experiments on d-dimensional problems 
with d — 2, 5, 10, 20, 40 (Fig. [4| yield to propose an average 
best tuning of Ntraining depending on dimension d: 

Ntraining = [40 + 4d X ' 7 \ (5) 

Eq. ((51 is found to empirically outperform the one proposed 
in |21| (Ntraining = |^70\/dJ ), which appears to be biased to 
10-dimensional problems, and underestimates the number of 
training points required in higher dimensions. Experimen- 
tally however, Ntraining must super-linearly increase with d; 
eq. © states that for large d the number of training points 
should triple when d doubles. 

Further, Fig. 0] shows that the optimal Ntraining value is 
significantly smaller for the Sphere function than for other 
functions, which experimentally supports our conjecture that 
the Sphere function might be misleading with regard to the 
tuning of surrogate hyper-parameters. 

5.3 Online Tuning: Surrogate Hyper-parameters 

The IPOP- s *ACM-ES achieves the online adaptation of 
the surrogate hyper-parameters within a specified range (Ta- 
ble [IJ, yielding the surrogate hyper-parameter values to be 
used in the next surrogate learning step. 

Note that a surrogate hyper-parameter individual might 
be non-viable, i.e. if it does not enable to learn a surrogate 
model (Ranking SVM fails due to an ill-conditioned kernel). 
Such non- viable individual is heavily penalized (Err(a) >> 
1). In case no usable hyper-parameter individual is found 
(which might happen in the very early generations as it is 
shown on Fig. [5), 9 sur is set to its default value. 

The online adaptation of surrogate hyper-parameters how- 
ever soon reaches usable hyper-parameter values. The tra- 
jectory of the surrogate hyper-parameter values vs the num- 
ber of generations is depicted in Fig. [5] normalized in [0, 1] 



and considering the median out of 15 runs optimizing 20 
dimensional Rotated Ellipsoid function. 

The trajectory of Ntraining displays three stages. In a first 
stage, Ntraining increases as the overall number of evaluated 
points (all points are required to build a good surrogate). 
In a second stage, Ntraining reaches a plateau; its value is 
close to the one found by offline tuning (section 15. 2[) . In 
a third stage, Ntraining steadily decreases. This last stage 
is explained as CMA-ES approaches the optimum of / and 
gets a good estimate of the covariance matrix of the Ellipsoid 
function. At this point the optimization problem is close to 
the Sphere function, and a good surrogate can be learned 
from comparatively few training points. 

The trajectories of other surrogate hyper-parameters are 
more difficult to interpret, although they clearly show non- 
random patterns (e.g. C pow )- 

5.4 Comparative Performances 

The comparative performance of s ACM-ES combined with 
the original and the active variants of IPOP-CMA-ES is 
depicted on Fig. [6] on the 10-d Rotated Ellipsoid (Left) 
and Rosenbrock (Right) functions. In both cases, the online 
adaptation of the surrogate hyper-parameters yields a quasi 
constant speed-up, witnessing the robustness of s ACM-ES. 
On the Ellipsoid function, the adaptation of the covariance 
matrix is much faster than for the baseline, yielding same 
convergence speed as for the Sphere function. On the Rosen- 
brock function the adaptation is also much faster, although 
there is clearly room for improvements. 

The performance gain of S ACM-ES, explained from the 
online adjustment of the surrogate hyper-parameters, in par- 
ticular Ntraining , confirms the fact that the appropriate sur- 
rogate hyper-parameters vary along search, and can be ad- 
justed based on the accuracy of the current surrogate model. 
Notably, IPOP-" ACM-ES almost always outperforms IPOP- 
ACM-ES, especially for d > 10. 

5.5 Scalability w.r.t Population Size 

The default population size Xdefauit is suggested to be the 
only CMA-ES parameter to possibly require manual tuning. 
Actually, Xdefauit is well tuned for uni-modal problems and 
only depends on the problem dimension. Increasing the pop- 




Figure 6: Comparison of the proposed surrogate-assisted versions of the original and active IPOP-CMA-ES 
algorithms on 10 dimensional Rotated Ellipsoid (Left) and Rosenbrock (Right) functions. The trajectories 
show the median of 15 runs. 



Speedup for large population sizes 
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Figure 7: Speedup of the IPOP- s *aACM-ES over 
IPOP-aCMA-ES for large population sizes A = 
J^defauit on 10-D problems. 



ulation size does not decrease the overall number of function 
evaluations needed to reach an optimum in general. Still, it 
allows one to reach the optimum after fewer generations. In- 
creasing the population size and running the objective func- 
tion computations in parallel is a source of speed-up, which 
raises the question of S *ACM-ES scalability w.r.t. the pop- 
ulation size. 

Fig. [3 shows the speedup of the IPOP- s *aACM-ES com- 
pared to IPOP-aCMA-ES for unimodal 10 dimensional prob- 
lems, when the population size A is set to 7 times the default 
population size Xdefault- For F8 Rosenbrock, F12 Cigar and 
F14 Sum of Different Powers the speedup remains almost 
constant and independent of 7, while for F10 Rotated Ellip- 
soid, Fll Discus and F13 Sharp Ridge, it even increases with 
7. We believe that with a larger population size, "younger" 
points are used to build the surrogate model, that is hence 
more accurate. 

The experimental evidence suggests that S *ACM-ES can 
be applied on top of parallelized versions of IPOP- s *aACM- 
ES, while preserving or even improving its speed-up. Note 
that the same does not hold true for all surrogate-assisted 
methods; for instance in trust region methods, one needs to 
sequentially evaluate the points. 

It is thus conjectured that further improvements of CMA- 
ES (e.g. refined parameter tuning, noise handling) will trans- 
late to S *ACM-ES, without degrading its speed-up. 



6. CONCLUSION AND PERSPECTIVES 

This paper presents a generic framework for adaptive sur- 
rogate-assisted optimization, which can in principle be com- 
bined with any iterative population-based optimization, and 
surrogate learning, algorithms. This framework has been 
instantiated on top of surrogate-assisted ACM-ES, using 
CMA-ES as optimization algorithm and Ranking SVM as 
surrogate learning algorithm. The resulting algorithm, 
"*ACM-ES, inherits from CMA-ES and ACM-ES the prop- 
erty of invariance w.r.t. monotonous transformations of the 
objective function and orthogonal transformations of the 
search space. 

The main contribution of the paper regards the online 
adjustment of i) the number n of generations a surrogate 
model is used, called surrogate lifelength; ii) the surrogate 
hyper-parameters controlling the surrogate learning phase. 
The surrogate lifelength is adapted depending on the qual- 
ity of the current surrogate model; the higher the quality, 
the longer the next surrogate model will be used. The ad- 
justment of the surrogate hyper-parameters is likewise han- 
dled by optimizing them w.r.t. the quality of the surrogate 
model, without requiring any prior knowledge on the opti- 
mization problem at hand. 

IPOP- s *aACM-ES was found to improve on IPOP-aCMA- 
ES with a speed-up ranging from 2 to 3 on uni-modal d- 
dimensional functions from the BBOB-2012 noiseless testbed, 
with dimension d ranging from 2 to 40. On multi-modal 
functions, IPOP- s *aACM-ES is equally good or sometimes 
better than IPOP-aCMA-ES, although the speed-up is less 
significant than for uni-modal problems. Further, IPOP- s * 
aACM-ES also improves on IPOP-aCMA-ES on problems 
with moderate noise from BBOB-2012 noisy testbed. All 
these results as well as the computational complexity of the 
algorithm are discussed in details in BBOB-2012 workshop 
papers [25] and |23| . 

A long term perspective for further research is to bet- 
ter handle multi-modal and noisy functions. A shorter-term 
perspective is to consider a more comprehensive surrogate 
learning phase, involving a portfolio of learning algorithms 
and using the surrogate hyper-parameter optimization phase 
to achieve portfolio selection. Another perspective is to de- 
sign a tighter coupling of the surrogate learning phase, and 
the CMA-ES optimization, e.g. using the surrogate model 
/ to adapt the CMA-ES hyper-parameters during the opti- 
mization of the expensive objective /. 
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